Optimize MongoDBExportPartitionSupplier for uniform _id type collections by dinujoh · Pull Request #6910 · opensearch-project/data-prepper

dinujoh · 2026-06-07T11:54:19Z

Description

For collections with uniform _id types, replace the $or query with a simple Filters.gt("_id", value) for finding partition boundaries. This allows DocumentDB to use a single B-tree index seek instead of multi-index scan.

Changes:

Add isUniformIdType() that checks first/last doc _id types
Add buildNextStartFilter() with simple $gt for uniform types, falling back to $or-based query for mixed types
Use fresh Filters.gte() + skip() per iteration for partition end
Extract addPartition() helper to reduce duplication
Make BsonHelper.isClassNumber() public for numeric type grouping

Performance: 14M docs (10GB) partitioned in ~30 seconds.

Check List

New functionality includes testing.
New functionality has a documentation issue. Please link to it in this PR.
- New functionality has javadoc added
Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-06-07T11:54:47Z

✅ License Header Check Passed

All newly added files have proper license headers. Great work! 🎉

dlvenable · 2026-06-08T14:01:16Z

+import static org.mockito.Mockito.when;
+
+@ExtendWith(MockitoExtension.class)
+public class MongoDBExportPartitionSupplierIsUniformIdTypeTest {


Following existing conventions, add an underscore for clarity: MongoDBExportPartitionSupplier_IsUniformTypeTest. Also, make this package protected (remove public modifier).

dlvenable · 2026-06-08T14:09:16Z

+     * If uniform, we can use a simple Filters.gt() instead of the complex $or query across all BSON types.
+     */
+    boolean isUniformIdType(final MongoCollection<Document> col) {
+        final Document first = col.find().projection(ID_PROJECTION).sort(ID_ASC).limit(1).first();


Can these two be combined to avoid two network calls?

DocumentDB doesn't support $facet aggregation to get first and last in one query. The two queries are both indexed _id lookups (ascending limit 1, descending limit 1) each takes <1ms.

dlvenable · 2026-06-08T14:14:47Z

+                final Object gteValue = startDoc.get("_id");
+                final String gteClassName = gteValue.getClass().getName();
+
+                final Document endDoc = col.find(Filters.gte("_id", gteValue))


Maybe name this endOfPageDoc or something similar for clarity.

dlvenable · 2026-06-08T14:15:50Z

+                .thenReturn(new Document("_id", 3.14))
+                .thenReturn(new Document("_id", Decimal128.parse("99.99")));
+        assertThat(supplier.isUniformIdType(collection), is(true));
+    }


Maybe also include a test case for a real number type like double and and integer type as well.

dlvenable · 2026-06-08T14:27:23Z

+
+            // isUniformIdType: col.find() called twice (first asc, last desc)
+            // then col.find() for last doc when endDoc is null
+            when(col.find()).thenReturn(uniformCheckFirst, uniformCheckLast, lastDocIterable);


It would be better to use whenAnswer. Then look at the input to determine which to return. This is creating a coupling of the order here with the order in the implementation that need not exist.

dlvenable

Thanks @dinujoh for this contribution! This looks like a good performance improvement.

For collections with uniform _id types, replace the 8-clause $or query with a simple Filters.gt("_id", value) for finding partition boundaries. This allows DocumentDB to use a single B-tree index seek instead of multi-index scan. Changes: - Add isUniformIdType() that checks first/last doc _id types - Add buildNextStartFilter() with simple $gt for uniform types, falling back to $or-based query for mixed types - Use fresh Filters.gte() + skip() per iteration for partition end - Extract addPartition() helper to reduce duplication - Make BsonHelper.isClassNumber() public for numeric type grouping Performance: 14M docs (10GB) partitioned in ~30 seconds. Signed-off-by: Dinu John <86094133+dinujoh@users.noreply.github.com>

dinujoh requested review from KarstenSchnitter, Zhangxunmt, divbok, dlvenable, graytaylor0, kkondaka, oeyh, san81, sb2k16, srikanthjg and srikanthpadakanti as code owners June 7, 2026 11:54

dinujoh force-pushed the main branch from e218214 to 792e5bb Compare June 7, 2026 23:14

dlvenable reviewed Jun 8, 2026

View reviewed changes

dinujoh force-pushed the main branch from 792e5bb to 4bb88ce Compare June 9, 2026 15:43

dinujoh force-pushed the main branch from 4bb88ce to f6f9983 Compare June 9, 2026 16:11

dlvenable approved these changes Jun 9, 2026

View reviewed changes

sb2k16 approved these changes Jun 9, 2026

View reviewed changes

dinujoh merged commit 2bd162c into opensearch-project:main Jun 9, 2026
71 of 72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910

Optimize MongoDBExportPartitionSupplier for uniform _id type collections#6910
dinujoh merged 1 commit into
opensearch-project:mainfrom
dinujoh:main

dinujoh commented Jun 7, 2026

Uh oh!

github-actions Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

dlvenable Jun 8, 2026

Uh oh!

dlvenable Jun 8, 2026

Uh oh!

dinujoh Jun 9, 2026

Uh oh!

dlvenable Jun 8, 2026

Uh oh!

dlvenable Jun 8, 2026

Uh oh!

dlvenable Jun 8, 2026

Uh oh!

dlvenable left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dinujoh commented Jun 7, 2026

Description

Check List

Uh oh!

github-actions Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ License Header Check Passed

Uh oh!

dlvenable Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

dlvenable Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

dinujoh Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

dlvenable Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

dlvenable Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

dlvenable Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

dlvenable left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 7, 2026 •

edited

Loading